Joint Event Detection and Description in Continuous Video Streams
نویسندگان
چکیده
As a fine-grained video understanding task, dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) that solves the dense captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers and proposes variable-length temporal events based on pooled features. In order to explicitly model temporal relationships between visual events and their captions in a single video, we propose a two-level hierarchical LSTM module that transcribes the event proposals into captions. Unlike existing dense video captioning approaches, our proposal generation and language captioning networks are trained end-to-end, allowing for improved temporal segmentation. On the large-scale ActivityNet Captions dataset, JEDDi-Net demonstrates improved results as measured by most language generation metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.
منابع مشابه
Action Change Detection in Video Based on HOG
Background and Objectives: Action recognition, as the processes of labeling an unknown action of a query video, is a challenging problem, due to the event complexity, variations in imaging conditions, and intra- and inter-individual action-variability. A number of solutions proposed to solve action recognition problem. Many of these frameworks suppose that each video sequence includes only one ...
متن کاملContinuous Tracking Within and Across Camera Streams
This paper presents a new approach for continuous tracking of moving objects observed by multiple, heterogeneous cameras. Our approach simultaneously processes video streams from stationary and Pan-Tilt-Zoom cameras. The detection of moving objects from moving camera streams is performed by defining an adaptive background model that takes into account the camera motion approximated by an affine...
متن کاملState Space Approaches for Modeling Activities in Video Streams
Title of dissertation: STATE SPACE APPROACHES FOR MODELING ACTIVITIES IN VIDEO STREAMS Naresh P. Cuntoor Doctor of Philosophy, 2006 Dissertation directed by: Professor Rama Chellappa Department of Electrical and Computer Engineering The objective is to discern events and behavior in activities using video sequences, which conform to common human experience. It has several applications such as r...
متن کاملJoint processing of audio and visual information for multimedia indexing and human-computer interaction
Information fusion in the context of combining multiple streams of data e.g., audio streams and video streams corresponding to the same perceptual process is considered in a somewhat generalized setting. Speci cally, we consider the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition of descriptors e.g., speech recognition/transcription,...
متن کاملRecognizing Complex Events Using Large Margin Joint Low-Level Event Model
In this paper we address the challenging problem of complex event recognition by using low-level events. In this problem, each complex event is captured by a long video in which several low-level events happen. The dataset contains several videos and due to the large number of videos and complexity of the events, the available annotation for the low-level events is very noisy which makes the de...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1802.10250 شماره
صفحات -
تاریخ انتشار 2018